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In this paper, we consider some long-standing problems in communication systems with access to noisy feedback. 
CN) , We introduce a new notion, the residual directed information, to capture the effective information flow (i.e. mutual 

information between the message and the channel outputs) in the forward channel. In light of this new concept, we 
investigate discrete memoryless channels (DMC) with noisy feedback and prove that the noisy feedback capacity is 
not achievable by using any typical closed-loop encoder (non-trivially taking feedback information to produce channel 
\ inputs). We then show that the residual directed information can be used to characterize the capacity of channels with 

> 

jy-^ ■ noisy feedback. Finally, we provide computable bounds on the noisy feedback capacity, which are characterized by 
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the causal conditional directed information. 
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I. Introduction 

The theory of feedback has been well studied O for control systems but only partially investigated for commu- 
nication systems. So far, a large body of work has looked at communication channels with perfect feedback and 
obtained many notable results. See 0, J4]], 0, 0, Q, 0, J5), iflOll and reference therein. As an illustration, it 
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is known that perfect feedback improves the error exponent and reduces the coding complexity ifTTI . For channels 
with memory, using perfect feedback can increase the capacity compared with the non-feedback case 13. However, 
only few papers have studied channels with noisy feedback and many challenging problems are still open. Namely, 
how does noisy feedback affect the transmission rate in forward communication channels? Is noisy feedback helpful 
in improving decoding error exponent or reducing encoding complexity? More generally, is feedback beneficial to 
communicate even though it is noisy? These questions are difficult because the noisy feedback induces a loss of 
coordination between the transmitter and the receiver. We can classify the results in the literature into two main 
categories. The first category studies the usefulness of noisy feedback by investigating reliability functions and 
error exponents. lfl2l shows that the noisy feedback can improve the communication reliability by specifying a 
variable-length coding strategy. Ifl3l derives the upper and lower bounds on the reliability function of the additive 
white Gaussian noise channel with additive white Gaussian noise feedback. lfT4l considers a binary symmetric 
channel with a binary symmetric feedback link and shows that the achievable error exponent is improved under 
certain conditions. The second category focuses on the derivation of coding schemes mostly for additive Gaussian 
channels with noisy feedback based on the well-known Schalkwijk-Kailath scheme 0. We refer interested readers 
to US, ED, 03, OS, 03 for details. 

Instead of concentrating on specific aspects or channels, in this paper, we study the noisy feedback problem in 
generality. We first focus on the effective information flow through channels with noisy feedback. We introduce 
a new concept, the residual directed information, which exactly quantifies the effective information flow through 
the channel and provides us with a clear view of the information flow in the noisy feedback channel. In light of 
this new concept, we show the failure of using the directed information defined by Massey J6) in noisy feedback 
channels, which is otherwise useful in the perfect feedback case. Furthermore, we investigate the DMC with typical 
noisy feedback (definition [3 and prove that the capacity is not achievable by using any typical closed-loop encoder 
(definition [T3l). In other words, no encoder that typically (to be made more precise in the paper) uses the feedback 
information can achieve the capacity. This negative result is due to the fact that, by typically using noisy feedback, 
we need sacrifice certain rate for signaling in order to rebuild the cooperation of the transmitter and receiver such 
that the message can be recovered with arbitrarily small probability of error. Next, we give a general channel coding 
theorem in terms of the residual directed information for channels with noisy feedback, which is an extension of 
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|20l . The main idea is to convert the channel coding problem with noisy feedback into an equivalent channel coding 
problem without feedback by considering code-functions instead of code-words lETTl . l9l . In fact, code-functions 
can be treated as a generalization of code-words. By explicitly relating code-function distributions and channel input 
distributions, we convert a mutual information optimization problem over code-function distributions into a residual 
directed information optimization problem over channel input distributions. Although the theoretical result is in 
the form of an optimization problem, computing the optimal solution is not feasible. We then turn to investigate 
computable bounds which are characterized by the causal conditional directed information. Since this new form 
is a natural generalization of the directed information, the computation is amenable to the dynamic programming 
approach proposed by Tatikonda and Mitter |9j for the perfect feedback capacity problem. 

The main contributions of this paper can be summarized as follows: 1). We propose a new information theoretic 
concept, the residual directed information, to identify and capture the effective information flow in communication 
channels with noisy feedback and then analyze the information flow in the forward channel. 2). We prove that, 
for DMC with typical noisy feedback, no capacity-achieving closed-loop encoding strategy exists under certain 
reasonable conditions. 3). We show a general noisy feedback channel coding theorem in terms of the residual 
directed information. 4). We propose computable bounds on the noisy feedback capacity, which are characterized 
by the causal conditional directed information. 

Throughout the paper, capital letters X, Y, Z, ■ ■ ■ will represent random variables and lower case letters x, y, z, ■ ■ ■ 
will represent particular realizations. We use x n to represent the sequence (xi,X2, ■ ■ ■ ,x n ) and x° = 0. We use 
log to represent logarithm base 2. 

II. Technical Preliminaries 

In this section, we review and give some important definitions of probability theory and information theory, 
which are used throughout the paper. We begin with the following assumption. 

Assumption 1: Every random variable considered throughout the paper is in a finite set (i.e. X, y, Z, ■ ■ ■ ) with 
the power set cr-algebra. 

Although we restrict our exposition to finite alphabets, most of the results in this paper can be extended to the 
case of any abstract set (i.e. countably infinite or continuous alphabets ). 
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Definition 1: ll22l {Entropy) The entropy H(X) of a discrete random variable X is defined by 

H(X) = -Y,P(x)\ogp(x) 

We have the following properties of entropy. 
(PI) H{X) > 0. 

(P2) H(X, Y) = H(X) + H(Y\X). 

(P3) H(X) < logl^l, where \X\ denotes the cardinality of the finite set X, with equality if and only if X has 
a uniform distribution. 



Definition 2: (Mutual Information and Its Density) Consider two random variables X and Y with a joint 
probability mass function p(X, Y) and marginal probability mass functions p(X) and p(Y). The mutual information 
I(X; Y) is defined by 

I(X;Y) =E p(x , Y) log 

p(X)p(Y) 

and the mutual information density is defined by 

We present three properties of mutual information which will be used later. 

(P4) I(X; Y) = H{X) - H{X\Y). 

(P5) I((X, Y); Z\U) = I(Y; Z\U) + I(X; Z\(Y, U)). 

(P6) I(X; Y\Z) = H(Y\Z) - H(Y\X, Z) 

(P4) shows the relationship between mutual information and entropy. (P5) is Kolmogorov's formula ll23l . 
Now, we introduce a notion of causal conditional probability with respect to a time ordering of random variables. 

Definition 3: (Causal Conditional Probability) Given a time ordering of random variables (X n , Y n ) 

Xi , Y\ , X2 , Y2 , ■ ■ ■ , X n ,Y n ( 1 ) 

where X n S X n and Y n e y n , the causal conditional probability is defined by the following expression 

n n 

t(x n \y n )=T[p(x t \x l - 1 ,y 1 - 1 ) and f(y n \x n ) = Y[p(y t \x l ,y 1 - 1 ). 

i=l i=l 



Next, we present the definition of directed information with respect to the time ordering sequence (fTJ. 
Definition 4: (Directed Information and Its Density) Given a time ordering of random variables (X n , Y n ) as (fl} 
where X n S X n and Y n € y n , the directed information from a sequence X n to a sequence Y n is defined by 

~t(Y n \X n ) 

j ( x^y")=E p(x „, y „ ) iog ^L I 

and the directed information density is defined by 

■f(Y n \X n ) 



i(X n —} Y n ) = log- 



p(Y n ) 



Note that Massey's definition of directed information [6] can be easily recovered by the above definition. 

jf(Y n \X n ) 



I(X n ^Y n ) =E p(x „ iy „)log 



p(Y n ) 

Jf(y n \x r 



E p(x n ,2/")log 



p(y ) 



E p(x",y")^log 



K^kW ) 

^eA-^ey- i=i KiKllT 1 ) 

=E e ^ .» ) lQ g ^ 

=E E p^^^ ^ff 3 

=E^;^ _1 ) 

We refer the interested readers to (9J for the definition of directed information for an arbitrary time ordering 
of random variables. Next, we extend the definition of directed information to the causal conditional directed 
information as follows. 

Definition 5: (Causal Conditional Directed Information and Its Density) Given a time ordering of random 
variables (X n ,Y n ,Z n ) 

Xi, Yi, Z\, X2, Y2, Z2, ■ • ■ , X ni Y n , Z n (2) 

where X n G X n , Y n e y n and Z n G Z n , the directed information from a sequence X n to a sequence Y n causally 
conditioning on Z n is defined by 

~r?(Y n \X n 7 n \ 

1{x n _> yn^ =Ep(Xnyn ^ n)log jn_L_^_i 
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and the causal conditional directed information density is defined by 

i(X n -> Y n \\Z n ) = log ^^2 X ?V 
v II > 8 ^(Y n \Z n ) 



where 

n 

f(y n \x n ,z n ) = l[p(y i \x i ,y i - 1 ,z i - 1 ) 

i=l 

It is easy to verify that 

n 

I(X n -> Y n \\Z n ) = ^I{X\Y l \Y l ~ 1 1 Z' l ~ 1 ) 

Remark 1: If Markov chains Zf - Y^ 1 - Y\ Zf - (X\Y 1 - 1 ) - Y i hold, we may obtain 

I(X n Y n \Z n ) 

n 

= Y J I{X\Y l \Y l -\Z n ) 

i=l 
n 

= ^2H(Y i \Y l -\Z n )-H(Y i \X\Y i - 1 ,Z n ) 

i=l 
n 

= ^2H{Yi\Y i -\Z i - 1 )-H{Yi\X\Y l -\Z 1 - 1 ) 

i=l 
n 

i=l 

=I(X n -+Y n \\Z n ) 
That is, the "causal conditioning" and the "normal conditioning" coincide. 



III. Residual Directed Information and Information Flow 

In this section, we first introduce the setup considered in the paper and give a high-level discussion on the failure 
of using either mutual information or directed information as a measure of the effective information flow through 
the channel. Then we define a new measure, named residual directed information, and derive its properties. Finally, 
we analyze the information flow in the noisy feedback channel. 



A. Noisy Feedback and Causality 

According to Fig[T] we model the channel at time i as p(yi\x 1 , y 1 ^ 1 )- The channel output (without any encoding) 
is fed back to the encoder through a noisy link, which is modeled as p(zi\y 1 , z l ~ r ). At time i, the deterministic 

6 



w- 



Encoder 



Xi Channel 

p{y i \y i ~ 1 ,x i ) 



Decoder 



w 



1-Step Delay 



Feedback Link 

p(zi|z l-1 ,y*) 



Fig. 1: Channels with noisy feedback 



encoder takes the message W and the past outputs Zi,Z2,--- , of the feedback link, and then produces a 
channel input X;. Note that the encoder has access to the output of the feedback link with one time-step delay. At 
time n, the decoder takes all the channel outputs Y\, Y2, ■ ■ ■ ,Y n and then produces the decoded message W. We 
present the time ordering of these random variables below. 

W, X\, Yi, Zi, X2, Y2, Z2, • • • , X n -i, Y n _i, Z n -i, X n , Y n , W 

Note that all initial conditions (e.g. channel, feedback link, channel input, etc.) are automatically assumed to be 
known in prior by both the encoder and the decoder. Before entering the more technical part of this paper, it is 
necessary to give a specific definition of "noisy feedback". 

Definition 6: (Noisy Feedback Link) The feedback link is noisy if for some time instant i there exists no function 
gi such that 

g i (X i ,Z i ,W)=Y i . (3) 

The feedback link is noiseless if it is not noisy. 

Remark 2: This definition states that, for noisy feedback links, not all the channel outputs can be exactly recovered 
at the encoder side and, therefore, the encoder and decoder lose mutual understanding. In other words, at time instant 
the encoder cannot access to the past channel outputs Y' 1 through information (X 1 , Z l , W) to produce channel 
input Xj+i. We refer "perfect (ideal) feedback" to be the case of Z % = Y % for all time instant i. Essentially, noiseless 
feedback is equivalent to perfect feedback since, in both cases, the encoder can access to the channel outputs without 
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any error. 

Example 1: Consider the feedback link as Zi = Yi + V where Vi denotes additive noise at time instant i. If 
channel outputs Yi only takes value in a set of integers (i.e. ±1, ±2, • • •) and Vi only takes value in {±0.2, ±0.4}, 
then obviously the channel outputs can be exactly recovered at the encoder side. Thus, this feedback link is noiseless 
even though it is imperfect. 

Next, we give a definition of typical noisy feedback link which will be studied in the next section. 

Definition 7: (Typical Noisy Feedback Link) Given channel {p(yi\x l , J/ 1-1 )}^, the noisy feedbacklink {p(zi\y 1 , z 1 ~ 1 )} ( ^ 1 
is typical if it satisfies 

1 " 

liminf -V H{Z l - Y \Y l ~ 1 ) > (4) 

n—>oo Ji & — ' 
i=l 

for any channel input distribution {p(xi\x l ~ 1 , z 1 ' 1 )}^. The noisy feedback link is non-typical if it is not typical. 



Remark 3: This definition implies that the noise in the feedback link must be active consistently over time (e.g. 
not physically vanishing). In practice, the typical noisy feedback link is the most interesting case for study. 

Example 2: Consider a binary symmetric feedback link modeled as Zi = Yi © Vi where noise Vi is i.i.d and 

takes value from {0, 1} with equal probability. Then we have 

-. n i n 

liminf - V HiZ'-^Y 1 - 1 ) = liminf - V flYy* -1 !!* -1 ) 

i=l i=l 

1 " 

> liminf - V H(Vi-i |y i_1 ) 



i=l 



1 " 

liminf- Vi2m_!) 



W 1 . 

n— >oo n 



= 1 

where (a) follows the fact that Y' 1 ^ 1 is independent from Vi-\ due to one step delay. Therefore, this noisy feedback 
link is typical. 

We summarize the family of the feedback link in Figffl] We next define the achievable rate and capacity for 
channels with noisy feedback. 



In the sequel, the term "noisy feedback" refers to "typical noisy feedback" unless specified. 
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Definition 8: (Channel Code) Consider a message W which is drawn from an index set {1, 2, • • • , M} and a noisy 
feedback communication channel (X n , {p(yi\x l , y 4_1 )}" =1 , y n , {p(zi\y l , Z") with the interpretation that 

X% is the input and Yi is the output/input of the channel/feedback and Zi is the output of the noisy feedback link 
at time instant i (1 < i < n). Then a (M,n) channel code consists of an index set {1,2,- ■• ,M}, an encoding 
function: {1, 2, • • ■ , M} x — > X n , and a decoding function:^" — > {1, 2, ■ • • , M} where the decoding function 
is a deterministic rule that assigns a guess to each possible received vector. 

Definition 9: (Achievable Rate) The rate R of a (M, n) code is 

log M 

R = bits per channel use 

n 

The rate is said to be achievable if there exists a sequence of (2 nR , n) codeqj such that the maximal probability of 
error tends to zero as n — > oo. 

Definition 10: (Channel Capacity) The capacity of a channel with noisy feedback is the supremum of all 
achievable rates. 

When there is no feedback from the channel output to the encoder, the maximum of mutual information (i.e. 
maxp^nj /(X";y n )) characterizes the maximum information flow through the channel with arbitrarily small 
probability of decoding error. This quantity is defined as the capacity of the channel. When there is a noiseless 
feedback, supremizing directed information I(X n — > Y n ) over ~p?(x n \y n ) gives us the feedback capacity Q, 1241 . 
iflOl . When there is a noisy feedback, the appropriate measure/characterization of the effective information flow 
through the channel has been unknown until now. In the next section, we provide the missing measure. 

B. Residual Directed Information 

Based on the "(causal conditional) directed information", the residual directed information and its density with 
respect to message W is defined as follows. 

Definition 11: (Residual Directed Information and Its Density) 

I R {X n {W) -> Y n ) = I{X n -> Y n ) - I{X n -> F n ||VF). (5) 

2 With a slight abuse of notation, we write nR instead of \nR\ for convenience. 
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Fig. 2: Family of Feedback links in Communication systems. The "typical noisy feedback" is the case which we are interested 



Equivalently, 

I R (X n (W) -> Y n ) = I{X n -> Y n ) - I(X n -> Y n \W). (6) 
The residual directed information density is defined as 

i R (X n {W) -> Y n ) = i(X n Y n ) - i(X n -»• 

The following theorem shows that the residual directed information captures the mutual information between the 
message and the channel outputs which we refer to the effective information flow. 

Theorem 1: If X n and Y n are the inputs and outputs, respectively, of a discrete channel with noisy feedback, 
as shown in Fig[T] then 

I{W; Y n ) = I R {X n (W) -> Y n ) = I(X n -> Y n ) - I(X n Y n \W). 
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Proof: 

I(W;Y n ) 

=H{Y n ) - H(Y n \W) 

n n 
= Y,H{Y l \Y*- 1 )-Y J H{Y l \Y l -\W) 

i=l i=l 
n n n n 

= H^Y 1 ' 1 ) - ]T H(Y i \Y i ~ 1 , W, X 1 ) - H^Y^^W) ~ J2 # M**" 1 , W, X 1 )) 

i—1 i—1 i—1 i—1 

n n 71 n 

i—1 i—1 i—1 i—1 

n n 

= J2 HX'iYilY'- 1 ) - Fil^" 1 , W) 

i=l i=l 

=i(x n -> F n ) - /(i" ->■ r"|w) 

where (a) follows from the Markov chain W — (X l ,Y l ~ 1 ) — Yi. Line (b) follows from the definition of residual 
directed information. 

■ 

Remark 4: This theorem implies that, for noisy feedback channels, the directed information I(X n Y n ) 
captures both the effective information flow (i.e. I(W; Y n )) generated by the message and the redundant information 
flow (i.e. I(X n — > Y n \W)) generated by the feedback noise (dummy message). Since only I(W; Y n ) is the relevant 
quantity for channel capacity, the well-known directed information clearly fails to characterize the noisy feedback 
capacity. 

In the following corollary, we explore some properties of the residual directed information. 
Corollary 1: The residual directed information I R (X n (W) — > Y n ) satisfies the following properties: 

1) I R (X n (W) — > Y n ) > (with equality if and only if the message set W and channel outputs Y n are 
independent.) 

2) I R {X n {W) -> Y n ) < I(X n -> Y n ) < I(X n ;Y n ). 

The first equality holds if the feedback is perfect. The second equality holds if there is no feedback. 

Proof: 1). Follows from Theorem Q] I R (X n (W) -5- Y n ) = I(W;Y n ) > 0. The necessary and sufficient 
condition of I R (X n (W) -> Y n ) = is obvious by looking at I(W; Y n ). 
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Fig. 3: Channels with additive noise feedback 



2). Since I{X n -> Y"[VT) = £™ =1 !{ x% \ Y^Y*' 1 ,W) > (equality holds for the perfect feedback case), 



I K {X n (W) -5- F") = I(X n -> F n ) - /(I" -> Y""|W) < /(I" -> F n ) 
The proof of the second inequality I{X n — > Y n ) < I(X n ; Y n ) is presented in J6)- 



C. Information Flow in Noisy Feedback Channels 

To gain more insight in the information flow of noisy feedback channels, we apply the new concept to channels 
with additive noise feedback and analyze its information flow. See Fig|3] We present the time ordering of these 
random variables below 3 



W, X 1 ,Y 1 ,V 1 ,X 2 , Y 2 , V 2 , ■ ■ ■ , X n _!, y„_i, V n -i, X n , Y n , W 

Corollary 2: If X n and Y n are the inputs and outputs, respectively, of a discrete channel with additive noise 
feedback, as shown in Fig[3] then 



I{X n -> Y n ) = I{W; Y n ) + /(y"- 1 ; Y n ) + I(W; V 11 ' 1 ^ 71 ) 



^Zi is not shown in the time ordering since we have Zi = Yj + Vi. 
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Proof: We herein adopt a derivation methodology similar to the one in Theorem Q] 

I(W;Y n ) 

=H(Y n ) - H(Y n \W) 

n n 

= Y,H(Y l \Y*- 1 )-Y,H(Y l \Y*-\W) 

i=l i=l 
n n n n 

= J2H(Y i \Y i - 1 )-^2H(Y i \Y i - 1 ,W,V i -' L ) - H(Y t \Y i ~ 1 , W) - ^ H(Y i \Y i ~ 1 ,W, V 1 " 1 )) 

i—1 i—1 i—1 i—1 

n n n n 

^^HiY^Y'-^-^HiYtlY 1 - 1 ^^ 1 - 1 ) - C^2h(Y 1 \Y' 1 -\W)-Y / H(Y 1 \Y 1 -\W 7 V 1 - 1 )) 

i—1 i—1 i—1 i—1 

n n n n 

= ^iJ(r,|y 4 - 1 )-^^(r l |r 4 -\x t )-(^F(y 4 |r 4 - 1 ,^)-^i/(r t |r 1 - 1 ,^,^- 1 )) 

i—1 i—1 i—1 

n n 

= Y^- 1 ) - KY*- 1 ;^- 1 , W) 



i—1 i—1 
rn-1 



=1(1" ->. Y n ) - I(V n ~ Y n \W) 
where (a) follows from the fact that Z 1 ^ 1 = Y' 1 ^ 1 + V 1 ^ 1 . Next, 

IiV"- 1 -> Y n \W) =I(V n ~ 1 ;Y n \W) 

=H(V n ~ 1 \W) — H(V n ~ 1 \Y n ,W) 

^=H(y n ~ l ) - H(V n ~ l \Y n ) + H^ 71 " 1 ^ 1 ) — H(V n ~ 1 \Y n , W) 

=I(V n ~ 1 ; Y n ) + I(W; i/ Tl_1 |y") 
where (b) follows from the fact that there exists no feedback from Y n to V 11 ' 1 and (c) follows from the fact that 
the noise l/" -1 is independent from W. Putting previous equations together, the proof is complete. ■ 
Corollary |2] allows us to explicitly interpret the information flow on a dependency graph (e.g. N = 3). See FigH] 
The solid lines from message W to sequence X 3 represent the dependence of X 3 on W. The dotted lines from 
additive noise V 2 to sequence X 3 represent the dependence of X 3 on V 2 . The dependence of the channel inputs X 3 
on the channel outputs Y 2 is not shown in the graph since the directed information only captures the information 
flow from X 3 to Y 3 (6). As it is shown in the zoomed circle, the directed information flow from X 3 to Y 3 (through 
cut A — E) implicitly contains three sub-information flows wherein the mutual information I(W; Y 3 ) and I(V 2 ; Y 3 ) 
measure the message-transmitting and the noise-transmitting information flows, respectively. The feedback noise V 2 
is treated as a dummy message which also needs to be recovered by the decoder. The conditional mutual information 
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Fig. 4: The information flow of channels with additive noise feedback 

I(W; V 2 \Y 3 ) quantifies the mixed information flow between the message-transmitting and noise-transmitting flows. 
Essentially, the second term in the residual directed information (i.e. I(X n — > Y n \W)) precisely captures the non- 
message transmitting information flows (i.e. I(y n ~ 1 ;Y n ) and I(W; V n ~ 1 \Y n )). Therefore, the residual directed 
information should be a proper measure to work with for channels with noisy feedback. 

Understanding the information flow in noisy feedback channels leads us to a higher level to investigate the noisy 
feedback problem and performs as the basis to develop fruitful results (to be seen later). 

TV. Discrete Memoryless Channel With Noisy Feedback 

With the new concept and the picture of the information flow in hand, we now concentrate on DMC with noisy 
feedback. We show a negative yet fundamental result that the capacity is not achievable by using any non-trivial 
closed-loop encoder. In other words, exploiting the information from the feedback link is actually detrimental to 
achieving the maximal achievable rate. We first give some necessary definitions below. 

A. Discrete Memoryless Channel and Typical Closed-Loop Encoder 

Definition 12: (Discrete Memoryless Channel) A discrete memoryless channel is a discrete channel satisfying 

p{y l \x\y 1 ^ 1 ) =p(yi\xi) 

Definition 13: {Typical Closed-Loop Encoder ) Given a channel {p(yi\x % , y 1-1 )}?^, a noisy feedback link 
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{p{zi\y t , z % an encoder is defined as a typical closed-loop encoder if it satisfies 



liminf-/^™- 1 -> Y n ) > 0. 

n— too n 



For the additive noise feedback case as shown in Fig[3] the condition is equivalent to 



liminf-I(V n-1 ;Y n ) > 0. 

n— >oo jl 



Remark 5: The equivalence is straightforward to check. That is, 



liminf-/(Z' 1 - 1 -> Y n ) =liminf- Y~ HtYAY 1 ) - H (YAY^ 1 , Z 1 ' 1 ) 

n— >oo Jl n^-oo ft — ' 



= liminf- Y^HfrlY 1 ) ~ HfrlY 1 - 1 ^- 1 ) 



= \immf-I(V n - 1 ^Y n ) 

n— >oo Jl 

= ) iiminf-/(y"- 1 ; r n ). 

where (a) follows the fact that there is no feedback from Y to V and thus the mutual information and the directed 



information coincide. 



Remark 6: This definition implies that a typical closed-loop encoder should non-trivially take feedback informa- 



tion Z n to produce channel inputs X n over time. It is easy to verify that an encoder is non-typical if it discards 



all feedback information (i.e. open-loop encoder) or only extracts feedback information for finite time instants. 



Remark 7: The typical closed-loop encoder is only well-defined under the assumption of typical noisy feedback 



(definition |7J. Otherwise, for any encoder, we have 



liminf-/(Z' 1 - 1 -S-Y") =liminf-y /(Z'-^YIF 4 - 1 ) 

n—>oc ji n— >oo ji * ^ 



= liminf - V HiZ 1 - 1 ^' 1 - 1 ) - Hitf-HY*) 

n->oo fl ^ 
i=l 



< liminf - V H{Z l - 1 \Y l ~ 1 ) 

n— yoo fi — 1 

i=l 



=0. 



Now, we present the main theorem of this section. 



Theorem 2: The capacity Cp°g se of a discrete memoryless channel with noisy feedback equals the non-feedback 



capacity C. The capacity Cp°^ se is not achievable by implementing any typical closed-loop encoder. Alternatively, 
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any capacity-achieving encoder is non-typcial. Furthermore, the rate-loss by implementing a typical closed-loop 
encoder is lower bounded by liminfyj^oo i/(Z n_1 — > Y n )i_ 

Remark 8: This negative result implies that it is impossible to find a capacity-achieving feedback coding scheme 
for DMC with noisy feedback whereas it is possible in perfect feedback case (e.g. Schalkwijk-Kailath scheme). 
For example, ifTTl has proposed a linear coding scheme for AWGN channel with bounded feedback noise and lfl8l 
has proposed a concatenated coding scheme for AWGN channel with noisy feedback. It is easy to check that both 
of these closed-loop encoders are typical and therefore both coding schemes cannot achieve the capacity unless, as 
discussed in IfTTl . IPT81 . the feedback additive noise is shrinking to zero (i.e. non-typical noisy feedback). 

Remark 9: Theorem |2] indicates that the noisy feedback is unfavorable in the sense of achievable rate. However, 
using noisy feedback still provides many benefits as mentioned in the Introduction. Furthermore, from a control the- 
oretic point of view, (noisy) feedback is necessary for stabilizing unstable plants and achieving certain performances. 
Therefore, we need a tradeoff while using noisy feedback. 

Before moving to prove the main theorem, we need the following lemma. 

Lemma 1: For any typical closed-loop encoder, 

liminf -I(X n -> Y n \W) > 0. 

n— >oo fi 

Proof: For any < i < n, we have 

IiWiZilY^Z*- 1 ) =H(Z l \Y l ,Z i - 1 ) - H(Z i \Y i ,Z i - 1 ,W) 

=H{Z i \Y i ,Z i - 1 ) - H{Z l \Y\Z 1 - 1 ) 
=0. 

4 The "rate-loss" refers to the gap between the capacity C and the achievable rate R. Given a channel {p(yi\x l ,y l ~ 1 )}°^L 1 and a noisy 
feedback link {p(^|y l , z l ~ 1 )}°^_ 1 , the value of /(Z n—1 — ¥ Y n ) only depends on the channel input distributions {p(xi\x l ~ 1 , z l ~ 1 )}fjL 1 
induced by the implemented encoder. 
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Then, 

I(W; (Y n , Z™- 1 )) =I(W; (Y n , Z n )) - I(W; Z n \Y n , Z™" 1 ) 

71 

i=l 
n 

= Y,I(W;Y i \Y i -\Z i - 1 ) + I(W; Z Z \Y\ Z l ~ l ) 

i=l 
n 

= ^ j H{Y i \Y i - 1 ,Z i - 1 ) - H{Y l \Y l ~ 1 ,Z l ~ 1 ,W) 

i=l 
n 

= H ( Y t | Y <_ 1 , Z 1 ~ 1 ) - H ( F | Y 1 - 1 , Z 1 ~ 1 , W, X 1 ) 

i=l 
n 

=^F(F i |F < - 1 ,z < - 1 ) -f(f|f*-\x*) 

We investigate another equality as follows. 

i(x n -> F") - /(z™- 1 -> F") 

n 

= Y,H{Y i \Y i ) - H{Y % \Y*-\X l ) - H(Y i \Y i ) + H(Y i \Y i ~ 1 , Z 1 ^ 1 ) 
i=i 

n 

= ^2H(Y i \Y i - 1 ,Z i ~ 1 ) -H{Y i \Y i - 1 ,X i ) 
i=l 

Combine the above equalities, we have 

7 (^n-i ^ =J(X" F") - I(W; (Y n , Z 11 - 1 )) 

= I(W; Y n ) + I{X n -¥ Y n \W) - I(W; (Y n , Z"" 1 )) 

=I(X n -> F n |W) - J(W; Z™- 1 ^™) 
where (a) follows from TheoremQ] According to the definition of typical closed-loop encoder, the proof is complete. 

■ 

Now we are ready to prove Theorem |2] 
Proof: Firstly, we prove that 

C ncnse = C = max /(X;F) 
p(x) 

Since a nonfeedback channel code is a special case of a noisy feedback channel code, any rate that can be 
achieved without feedback can be achieved with noisy feedback. Therefore, we have Cp°g se > C. Given a noisy 
feedback link, we clearly have C F °^ se < Cfb where Cfb is the capacity of channels with perfect feedback. As 

C = Cfb for DMC (25), we have Cp% se = C = max p(x) I(X; Y). 
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Next, we show that for any typical closed-loop encoder, the achievable rates R are strictly less then C and the 
difference is lower bounded by lim inf„_ i . 00 ^I{Z n ~ x — > Y n ). Let W be uniformly distributed over {1, 2, • • • , 2 nR } 
and P e (n) = Pr(W ^ W) with P e (n) -^Oasn^oo. Then 

nR = H(W) 

= H{W\W) + I(W;W) 

(a) 

< 1 + P^nR + I{W; W) 

< 1 + P^ n) nR + I(W; Y n ) 

where (a) and (b) follow from Fano's inequality and Data-processing inequality, respectively. 

Next, 

I(W; Y n ) = I R {X n {W) -> Y n ) 

= I(X n ->■ Y n ) - I(X n -» Y n \W) 

n n 

= J2H{Y l \Y i - 1 ) -Y.HiYilX 1 ^- 1 ) - I(X n Y n \W) 

i=l i=l 
n n 

(c) 



= Y J H{Y i \Y l - 1 )-^H(Y l \X l )-I(X n ->■ Y n \W) 



1=1 



< h w - E H ( Y i\ x i) - T ( xn -> 



^/(x i; i;-)-/(^ n ^F n |w) 



<nC-/(X n -^F"|W) 

where (c) follows from the definition of DMC and (d) follows from the fact that removing conditioning increases 
entropy. 

Putting these together, we have 

R< - + P^ n) R + C --I(X n ->Y n \W) 
n n 

Therefore, 

R < liminf{- + P e (n) i? + C - -I{X n -> Y n \W)} 

n— ><x> n n 

= C- liminf -I(X n -> Y n \W) 

n—¥oo ft 
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Fig. 5: Binary codeword erasure channel/feedback 



According to the proof of Lemma Q] we have 

R<C- liminf -(/(Z"- 1 -> Y n ) + I(W; Z n ~ x \Y n )) 

n— too 71 

< C - liminf -> F n ) 

n— >oo 71 

By the definition of typical closed-loop encoder, the proof is complete. ■ 
B. Example 

We give an example of communication through DMC with typical noisy feedback, from which we may get 
insight on how feedback "noise" reduces effective transmission rate and how signaling helps rebuild the coordination 
between the transmitter and the receiver. Consider a binary codeword erasure channel (BCEC) with a noisy feedback 
as shown in Figj5] The channel input is a m-bit codeword. This input codeword will be reliably transmitted with 
probability 1 — a, and otherwise get erased with probability a. Similarly, we assume a noisy feedback link with 
erasure probability p. It is obvious that the capacity of this channel is Cp°g se = m(l — a). One simple but 
nonoptimum encoding strategy is the following: use the first bit in every m-bit codeword as a signaling bit (i.e. 
1 refers to a retransmitted m-bit codeword while refers to a new one). If the output of the feedback link is 
e, the encoder will retransmit the previous codeword with signaling "1", otherwise, transmit the next codeword 
with signaling "0". Under this strategy, the decoder can recover the message with arbitrarily small error due to 
the signaling bit. Next, we analyze the achievable rate of this strategy. Assume that n bits information need to be 
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transmitted and n is sufficient large. Then an bits will be lost and (1 — a)n bits will reliably get through. Due 
to the noisy feedback, the encoder will retransmit b\ = an + p(l — a)n bits. Similarly, ab\ bits will be lost and 
(1 — a)b\ bits will get through. Then the encoder will retransmit 62 = ab\ + p(l — a)b\ bits. After retransmit t 
times with t — > 00, the achievable transmission rate R is 

i^lim_ - 

t^oo -L T ( n + b 1 + b 2 + ... + b t ) 

n(m — 1) 

~~ t^oo „ 1 {an+p(l-a)n)(l-a+p(l-a)y 
11 l-a+p(l-a)) 

n(m — 1) 

' l + l-(a+p(l-«)) 



Then we have 



= (m-l)(l-p)(l-a) 



R , . , 1 . 

= (i-p)(i-_). 



Here, it shows that the loss of transmission rate is caused by two factors: uncertainty in the feedback link and 
signaling in the forward channel. If p = (i.e. perfect feedback) and m —> 00 (i.e. the signaling bit could be 
neglected), we have R = Cp°g se . Additionally, we should notice an interesting fact in this example that the loss of 
effective transmission rate is independent of the noise in the forward channel. 

V. A Channel Coding Theorem and Computable Bounds on the Capacity 

In this section, we first show that the residual directed information can be used to characterize the capacity of finite 
alphabet channels with noisy feedback. As we will discuss, this characterization has nice features and provides much 
insight in the noisy feedback capacity. However, the computation of this characterization is in general intractable. 
We then propose computable bounds which are characterized by the causal conditional directed information. 

We first formulate the channel coding problem. Here, we require the use of code-functions as opposed to 
codewords, as shown in Fig|6] Briefly, at time 0, we choose a message from a message set W. This message is 
associated with a sequence of code-functions. Then from time 1 to n, we use the channels to transmit information 
sequentially based on the corresponding code-function. At time n + 1, we decode the message as VV. We now give 
a formal definition of this communication scheme, which extends the description presented in J9)- 
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Fig. 6: Channels with noisy feedback (a code-function representation) 



Definition 14: (Communication Scheme for Channels with Noisy Feedback: A Code-function Representation) 

1. A message set is a set W G {1, 2, • • • , M} 

2. A channel code-function is a sequence of n deterministic measurable maps /" = (/ 6 7) such that 
fi : Z l ~ 1 — > X which takes t— > Xi. 

3. A channel encoder is a set of M channel code-functions, denoted by {f n [w]}^ =1 . 

4. A channel is a family of conditional probability {p(yi\x t , y l-1 )}" =1 - 

5. A noisy feedback link is a family of conditional probability {p(zi\y l , 

6. A channel decoder is a map g : y n — > W which takes y n n> w. 



Based on the above communication scheme, we redefine the channel code and e-achievable rate in terms of 
code-functions. 

Definition 15: (Channel Code) A (n, M 1 e) channel code over time horizon n consists of M code-functions 
{f n [ w ]}w=i> a channel decoder g, and an error probability satisfying 

1 M 

— ^2p(w^g{y n )\w) < e 

w — l 

Definition 16: (e-achievable Rate) R > is an e-achievable rate if, for every e > 0, there exist, for all sufficiently 
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large n, a (n, M, e) channel code with rate 

logM 



> R-e 



The maximum e-achievable rate is called the e-capacity, denoted by Cp°^ se (e). The channel capacity Cp°g se is 
defined as the maximal rate that is e-achievable for all < e < 1. Clearly, Cp°^ se = lim £ ^o Cp°g se (e) 

The channel coding problem is to search for a sequence of (n, M, e) channel codes under which the achievable 
rate is maximized as n goes to oo. In order to construct a general channel coding theorem (i.e. no restrictions on 
channels and input/output alphabets, such as stationary, ergodic, • • • ), we introduce the following two probabilistic 
limit operations [20|. 

Definition 17: (Probabilistic Limit) The limit superior in probability for any sequence (X\,X2, • • • ) is defined 

by 

p — lim sup X„ = inf{a| lim Prob{X n > a} = 0} 

n — >oo n—^oo 

Similarly, the limit inferior in probability for any sequence (X x , ■ ■ • ) is defined by 

p - liminf X„ = sup{/3| lim Prob{X n < (3} = 0} 
Next, we introduce some notations. 

I(X; Y)=p- liminf -i(X n ; Y n ) 

n— >oo n 

1{X;Y) = p - Wmaup -i(X n ;Y n ) 

ri— too Tl 

l R (X{F) -^Y)=p- liminf —i R (X n (F n ) Y n ) 

n—yoo ji 

1 R \X{F) -+Y)=p- lim sup —i R (X n (F n ) -+ Y n ) 

As done in (9), it is convenient to consider the noisy feedback channel problem as a regular nonfeedback problem 
from the input alphabet T and output alphabet y as shown in Fig|6] This consideration provides us with an approach 
to prove the channel coding theorem for channels with noisy feedback. Recall that the capacity of nonfeedabck 
channels is characterized as follows 

Theorem 3: (Non-feedback Channel Capacity) For any channel with arbitrary input and output alphabets T and 
y, the channel capacity C is given by 

C = sup I (F;Y) 

F 
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where sup F denotes the supremum with respect to all the input processes F. 

However, before applying the above result, we need understand the inherent connection between the equivalent 
nonfeedabck channel and the original channel with noisy feedback link. Moreover, as supremizing the mutual 
information over code-function F is inconvenient, we need create a connection between the nonfeedback channel 
input distribution {p(f n )} and the original channel input distribution such that we can still work on the original 
channel input. These two issues are the main technical steps toward the channel coding theorem. We provide these 
results as lemmas in the next subsection. Then, we prove the channel coding theorem along the lines of the proof 
of Theorem [3] 

A. Technical Lemmas 

We first show an equality of information densities between the nonfeedback channel T n — > y n and the original 
channel X n -> y n . 
Lemma 2: 

i{F n ;Y n ) = i R (X n (F n ) -> Y n ) 

where i R {X n (F n ) -> Y n ) is defined as 

i R (X n (F n ) -> Y n ) = i(X n -> Y n ) - i(X n -> Y n \\F n ). 
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Proof: 



tpn y n ) 

i(F n ;Y n ) = lo™ ' 



log 
log 



p(F n )p(r n ) 

p(F n )p(Y n ) 

p(^ n )p(r n ) 



(«) , U7=iP(Yi\F i ,y i ~ 1 )p(Fj\F i ~ 1 ) 
p(F n )p(Y n ) 

p(Y n \F n ,X n ) , 
= lo S 7^ lo ? 



lo § — ^ 7TF^ lo § ■■ 



p(Y n ) rGL 1 pQ5|i ri - 1 ,-F i ) 

= ^g 1 TTTTT log 



p(y n ) p(r n |^™) 

p(Y n \X n ) p{Y n \F n ,X n ) 

= -> Y n ) - i(X n -> Y"||F") 
= i R (X n (F n ) -> Y n ) 

where (a) follows from the fact that no feedback exists from y to T. Line (b) follows from the Markov chain 
F i — (X i ,Y i ~ 1 ) — Yi. m 

In the next lemma, we shows that there exists a suitable construction of p(f n ) such that the induced channel 
input distribution equals the original channel input distribution. As we will see, this result allows us to work on 
the channel input distributions instead of code-function distributions. 

Lemma 3: Given a channel {p(yi\x l , a feedback link {p(zi\y l , z l ~ 1 )}f =1 , a channel input distribution 

{p(xi\x l ~ 1 , 2 4_1 )}™ =1 and a sequence of code-function distributions {p(/i|/ l_1 )}™ =1 , the induced channel input 
distribution {pmdix^x 1-1 , z I_1 )}f =1 (induced by equals the original channel input distribution 

{p(xi\x l ~ 1 , z l_1 )}™ =1 if and only if the sequence of code-function distributions {p(/i|/ i_1 )}f =1 is good with 
respect to {p(xi\x l ~ 1 , z 4_1 )}™ =1 . One choice of such a sequence of code-function distributions is as follows, 

piMf- 1 ) = n ~ V" 2 ), z*- 1 )- (7) 
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We refer the readers to Definition 5.1, Lemma 5.1 and 5.4 in J9) for the concept "good with respect to" and the 
proof of the above lemma. According to Lemma [3] it is straightforward to obtain the following result which plays 
an essential role in the channel coding theorem. 

Lemma 4: For channels with noisy feedback, 

P(x n ,y n ,f n ) 

n n 
^ z Encoding z ' x ~^ ( z '} % Feedback link Channel 

The proof is shown in the Appendix. This lemma implies that I_ R (X(F) — > Y) only depends on channel input 
distribution {p(x i \x l ~ 1 , z 1 ' 1 )}^. 

B. Channel Coding Theorem 

Now we show a general channel coding theorem in terms of the residual directed information. 
Theorem 4: (Channel Coding Theorem) For channels with noisy feedback, 

C^ se = sup L R (X(F) -> Y) (8) 

x 

where sup Y means that supremum is taken over all possible channel input distributions {p(x i \x l ~ 1 1 z l ~ 1 )}°^ 1 . 

The proof comes along the proof of Theorem [3] in l26l and hence is presented in the Appendix. Theorem 
2] indicates that, besides capturing the effective information flow of channels with noisy feedback, the residual 
directed information is also beneficial for characterizing the capacity. Although formula ^ may not be the only or 
the simplest characterization of the noisy feedback capacity, it provides benefits in many aspects. We herein present 
two of them as follows. 

1) . Measurements of Information Flows: Let p* be the optimal solution of formula ([8]). Then we obtain that, 
when the channel is used at capacity, the total transmission rate in the forward channel is in fact I_(X — > Y) 
instead of Cp°g se and the difference between them (i.e. redundant transmission rate) is I_(X — > Y\F) \ p * . These 
numerical knowledge might be crucial in system design and evaluation. 

5 I_(X — > denotes that the value is evaluated at channel input distributions p*. 
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2) . Induced Computable Bounds: Let q* = org sup x I_(X — > Y) where supremum is taken over all possible 
channel input distributions {p(xi\x l ~ 1 , z l ~ 1 )}^ 1 . Since code-function F is not involved at this point, the 
computation complexity is significantly reduced. Based on Theorem @] it is straightforward to obtain I_(X — > 
Y)\ q * and I_ R {X(F) — > Y)\ q * as uppeiQ and lower bounds on the capacity, respectively. Further, the gap 
between the bounds is I_{X — > Y\F)\ q *, which is definitely a tightness evaluation of the bounds. 

C. Computable Bounds on the Capacity 

As it is shown, the capacity characterization in Theorem 0] is not computable in general due to the probabilistic 
limit and code-functions. This motivates us to explore some conditions under which the previous characterization can 
be simplified or to look at some computable bounds instead. Toward this end, we first introduce a strong converse 
theorem under which the "probabilistic limit" can be replaced by the "normal limit". We then turn to characterize 
a pair of upper and lower bounds which is much easier to compute and tight in certain practical situations. 

Definition 18: (Strong Converse) A channel with noisy feedback capacity Cp°^ se has a strong converse if for 
any R > Cp°g se , every sequence of channel codes {(n, M n , e n )}^L 1 with 

lim inf - log M„ > R 

n— >oo fi 

satisfies lim„_ i . 00 e„ = 1 

Theorem 5: (Strong Converse Theorem) A channel with noisy feedback capacity C F °^ se satisfies the strong 
converse property if and only if 

supl R (X(F) -+Y) = su-pl R (X(F) — » yjj (9) 

X X 

Furthermore, if the strong converse property holds, we have 

C™ se = sup lim —I R (X n (F n ) -5- Y n ). 

6 Note that I(X -> Y)\ q , = sup^.^,- V i-i )} |o i l(X -> Y) < C FB = sup^^^i-i^i-ijjg l L(X -> Y) where C FB is the 
corresponding perfect feedback capacity. Therefore this upper bound is in general better than Cfb- 

7 This condition can be alternatively expressed as sup x /(-F; Y) = sup x I(F; Y). Since the computation complexity difference between 
the mutual information and residual directed information is not justified, either condition is a candidate for check. Note that how to check the 
strong converse is out of the scope of this paper. 
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The proof directly follows from chapter 3.5 in |20l by appropriate replacement of i R (X n (F n ) — > Y n ) on 
i(F n ; Y n ). This theorem gives us an important message that, for channels satisfying the strong converse property, 
we may compute the noisy feedback capacity by taking the normal limit instead of the probabilistic limit. How to 
further simplify the capacity characterization will be explored in the future. 

We next propose computable upper bounds on the noisy feedback capacity. 

Theorem 6: (Upper Bounds 

C"^ se = supliminf -I(X n -> Y n \\Z n ) (10) 

x n->oo n 

where Cp°g se denotes the upper bound of the capacity and the supremum is taken over all possible channel input 
distribution {p(xi\x l ~ 1 , z 4-1 )}"^. 

Remark 10: The computation complexity of formula ([Tol l, which is independent of code-functions, is significantly 
reduced and is similar to that of directed information. We herein conjecture that most of the algorithms for computing 
the directed information may apply to compute formula ( TTOb . For example, for finite-state machine channels ll27l 
with noisy feedback, formula ( TTOb may be computable by using dynamic programming approach along the lines of 

ED. 

We need the following lemma before showing the proof of Theorem [6] 
Lemma 5: 

I{F n - Y n ) = I R (X n (F n ) -> Y n ) = I(X n Y n \\Z n ) - I(F n ; Z n \Y n ) 



As we will see from the proof, this upper bound holds for any finite-alphabet channel with or without strong converse property. 
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Proof: 

I R (X n (F n ) -> Y n ) 
{ ^I{F n ;Y n ) 

=I(F n ; (Y n , Z n ))- I(F n ; Z n \Y n ) 
{ =ll(F n -> {Y r \Z n )) - I(F n ;Z n \Y n ) 

n 

= £ /(f* ( Y i > Z -)\ Y% ~ I(F" ; Z n \Y n ) 

i=l 
n 

= ^H{Y i ,Z i \Y i - 1 ,Z i - x ) - H(Y i ,Z i \Y i - 1 ,Z i - 1 ,F i )-I(F n ;Z n \Y n ) 

i=l 
n 

^^H(Z l \Y\Z' l - 1 ) + H(Y l \Y 1 - 1 ,Z' l - 1 )-H(Z l \Y\Z t - 1 ,F l )-H(Y l \Y' l ~\Z l ~\F t )~I(F n ;^ 

i=l 
n 

Q^HfflY*- 1 , Z 1 - 1 )- HiYilY*' 1 , Z l -\F 1 )- I(F n ; Z n \Y n ) 
i=i 

n 

~Y1 HiYilY^ 1 , Z 1 - 1 )- H{Y i \Y i - 1 ,X\ Z 1 - 1 ,^)- I(F n ; Z n \Y n ) 

i=l 
n 

= J2 H(Y\Y l -\ Z' 1 - 1 ) - H{Y t \Y l ^,X\ Z' 1 - 1 ) - I{F n ; Z n \Y n ) 
i=i 

n 

= ^2l(X\ Y l \Y , - 1 ,Z t - 1 ) - I(F"; Z n \Y n ) 
i=i 

=I(X n -> Y n \\Z n ) - I(F n ;Z n \Y n ) 
where (a) follows from Lemma [2] Line (b) follows from the fact that there exists no feedback from (Y n ,Z n ) 

to F n and thus the mutual information and directed information coincide. Line (c) follows from the fact that 

H(Z i \Y i ,Z i - 1 ) = H(Z i \Y i ,Z i - 1 ,F i ) since F i - (Y\Z l - r ) - Z t forms a Markov chain. Line (d) follows from 

the fact that X % can be determined by F % and the outputs of the feedback link Z 1 ^ 1 . Line (e) follows from the 

Markov chain F l - (Y l -\X\ Z 1 - 1 ) - Y t . ■ 

Now we present the proof of Theorem [6] as follows. 
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Proof: Recall Lemma Al in |28), we have I(F;Y) < liminf^oo ±I(F n ;Y n ) for any sequence of joint 
probability. That is, l R (X(F) ->Y)< liminf IWOO ±I R (X n (F n ) ->■ Y n ). Then by Lemma 

C F % se < supliminf —I R (X n (F n ) Y n ) 

x n-s-oo n 

= supliminf-(I(X n -^Y n \\Z n ) - I(F n ; Z n \Y n )) (11) 

^ n— s-oo n 

< supliminf -J (A"" Y"||Z n ) 

n-s-oo n 

■ 

Corollary 3: Assume that there is an independent additive noise feedback (FigO, then 

C^ se = supliminf -I(X n -»• Y n \V n ) 

Y n— too n 

where sup x means that supremum is taken over all possible channel input distribution {p(xi\x l ~~ 1 , y 1 " 1 +v z ~ 1 )}fl 1 . 



Proof: 

n 

I(X n ^Y n \\Z n ) =^T / I(X i ,Y i \Y i - 1 ,Z i - 1 ) 

i=l 
n 

= Y J I(X\Y l \Y l -\V 1 - 1 ) 

i=i 

=I(X n -> Y n \\V n ) 
=V" ->Y n \V n ) 

where (a) follows from remark Q] The proof is complete. ■ 
Next, we show a lower bound on the capacity for strong converse channels with additive noise feedback. In fact, 

any particular coding scheme may induce a low bound on the noisy feedback capacity. However, the lower bound 

proposed in the following has nice features and its own advantages. 

Theorem 7: {Lower Bound) Assume that a channel with an independent additive noise feedback (Figj3]l satisfies 

the strong converse property. A lower bound on the noisy feedback capacity is given by 

Llfb — u fb - n ( v ) 

where 

h(V) = limsup-fr(V n ). 
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Proof: We need to show that, for any 8 > 0, there exists a sequence of (n,M,e n ) channel codes (e n — > 
as n —> oo) with transmission rate 

R =C^ se - h{V) - 8 

= supliminf -I(X n -> Y n \V n ) - h(V) - 8. 
Now, for any fixed 5 > 0, we take £ satisfying < £ < <5 and let X% be a sequence of channel input distributions 

{p(xi\x l ~ 1 : z l ~ 1 )}°°^ 1 satisfying 



liminf -I(X n -+Y n \\Z r ' 

n—^oo ft 



= supliminf -I(X n -+Y n \\Z n )-£ (12) 
x=x x n ^°° n 



where (liminf, woo ±I(X n -> F n ||Z")) \ X=X( denotes that liminf,^^ -)• r n ||Z") is evaluated at X = 

X^. According to the definition of supremum, the existence of X% is guaranteed. Since for strong converse channels 
we have 

C^ se = sup lim —I R (X n (F n ) -> Y n ), 

X n->-oo n 

we know that, for any 8 > 0, there exist a sequence of (n,M,e n ) channel codes (e n — > as n — > oo) with 
transmission rate 

-(5-0- 



R= I lim -/- R (X n (F n ) -> F n ) 

\ n— s-oo n 
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By Lemma [5J 



R 



> 



> 



(a) 



> 



> 



lim -(I(X n -> Y n \\Z n ) - I(F n ;Z n \Y n )) 



(5-0 



x=x c 



lim -(/(X™ -> Y n \\Z n ) - H(Z n \Y n ) + H(Z n \Y n ,F n )) 



x=x e 



liminf -(/(X" -> Y n ||Z") - ff(Z"|Y n )) 



x=x F 



n \ 

liminf -(/(X™ -> Y n \\Z n ) -S" HtZilZ*- 1 ^)) 

i=0 / 
1 " \ 

liminf-(/(X n ^F"||Z n ) - Vi/(Z l |Z l " 1 ,y 2 )) 

i=0 / 

liminf -(/(X™ -> F"||Z n ) -V ^(^IV 1 - 1 )) 



x=x e 



x=x e 



(5-0 



(5-0 



i=0 



liminf -{I(X n -> Y n ||Z n ) - ff(V n )) 

n— >-oo 77, 



-(5-0 



(5-0 



liminf -7(X' 1 -> Y™||Z") 

n— too ji 



liminf -I(X" -> Y n ||Z") 

n— >oc 77, 



X=X F 



X=Xc 



liminf #(F n ) - (5 -0 

n— >-oo 77, 



limsup -H(V n ) -{6-0 

n— >-oo ^ 



= sup liminf i/(X" F"||Z n ) -£ - h(V) - {5 - 

jjf n-s-oo 77, 

= sup liminf -I(X n Y n ||Z") - fo(V) - «5 

^ n->oo 77, 

( = } sup liminf -J(X n -> Y n |F") - fe(V) - 6 

X n— ><x> n 

where (a) follows from the fact that Z t =Y t + V t and the Markov Chain Y*) - 1/ 1 " 1 - ^. Line (b) follows 

from equation <TT~2T >. Line (c) follows from Corollary 3. 

Since 6 can be arbitrarily small, the proof is complete. ■ 
Remark 11: This theorem reveals an important message that the gap between the proposed upper and lower 

bounds only depends on the feedback additive noise V (i.e. independent from the forward channel). Further, if the 



entropy rate of noise V goes to zerc 
known. 



the proposed upper and lower bound converges and thus the capacity is 



'in many practical situations, the entropy rate of the feedback noise is small. For example, if the feedback link only suffers intersymbol 
interference as illustrated in Chapter 4 1291 , the entropy rate turns out to be approximately 0.0808. Further, if the cardinality of V°° is finite 
(yet the feedback is still noisy), the entropy rate is clearly zero. 
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We end this section by investigating two examples of noisy feedback channels. 

Example 3: The example shows that for DMC with noisy feedback the characterized upper bound equals to the 
open-loop capacity. This implies that the upper bound should be tight when the channel "converges" to DMC. 
Besides, this example verifies the result (i.e. Theorem |2]i in Section IV. 

Consider a binary symmetric channel (BSC) with a binary symmetric feedback. Note that this is the simplest 
model of a noisy feedback channel, yet it captures most features of the general problem. We model the noisy 
channel/feedback as additive noise channel/feedback as follows. 

Y t = Xi + Ui (mod 2) and Z l =Y l + Vi (mod 2) 

where we assume that Pr(Ui = 1) = 1 - Pr(U l = 0) = a and Pr(V % = 1) = 1 - Pr(Vi = 0) = (3. 
It is known that the capacity of this noisy feedback channel equals the nonfeedback capacity 1 — H(&) where 
H(a) = —a log a — (1 — a)log(l — a). Next, we show that maximizing the conditional directed information in 
Corollary [7] provides the noisy feedback capacity. That is, 

supliminf -I(X n -> Y n \V n ) = 1 - H(a). 

x n->oo n 

This can be done as follows. 

n 

liminf-Ipf" -> Y n \V n ) = liminf - V I{X l - Y t \Y l -\ V 1 - 1 ) 

i=l 
1 U 

= \immi-y^H(Yi\Y l '\V i - 1 )-H(Y i \X\Y i - 1 ,V i ~ 1 ) 

i—1 
1 U 

=m*M-Y,H(x\Y i - l ,v i - 1 )-H(yi\x i ) 

n— >oo Tl — 

i=l 

n 

= liminf-y HiYAY 1 - 1 ^' 1 ) - HttJi) 

n— >oo n — ' 

i=l 

( a ) 1 n 

< liminf- S^HiYilY 1 - 1 ) - H{Ui) 

n— >oo fi — J 

i=l 

W 1 ™ 

< liminf - > H(Y t ) - H(Ui) 

77.— ^OO 71 ' J 



<1 -H(a) 

where taking equality in (a) implies limmf n _ ) . 00 ^I(V n ^ 1 ;Y n ) = 0, that is, the capacity-achieving encoder should 
be non-typical. This verifies Theorem [2] in Section IV. Taking equalities in (b) and (c) imply that the capacity- 
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noisy feedback capacity 
perfect feedback capacity 
non-feedback capacity 



1.5 



Fig. 7: The upper bound on the capacity of a first moving average Gaussian channel with AWGN feedback. 



achieving encoder should produce equal-probability channel outputs (i.e. uniform distribution). It is obvious that 
there exists such an optimal encoder that all above equalities hold. 

Example 4: In this example, we consider a colored Gaussian channel with additive white Gaussian noise feedback 
and compute the proposed upper bounco Specifically, we assume the forward channel and the feedback link as 



follows. 



Yi=Xi + Wi and Z^Y. + V, 



where W% = Ui + 0.1£/j_i, Ui is a white Gaussian process with zero mean and unit variance and Vi a white 
Gaussian process with zero mean and variance a. We take coding block length n = 30 and power limit P = 10 
for computing the upper bound. See Fig. [7] We refer the interested readers to OTI . Il30ll for the details of the 
computation and discussions. From the plot of the upper bound, we see that the noisy feedback capacity is very 
sensitive to the feedback noise, at least for certain Gaussian channels. 

'"Although the Gaussian channels are not finite-alphabet, the upper bound characterization still holds. The derivation of the upper bound 
follows exactly the same idea in this paper and can be found in 1301 . 
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VI. Conclusion 



We proposed a new concept, the residual directed information for characterizing the effective information flow 
through communication channels with noisy feedback, which extends Massey's concept of directed information. 
Based on this new concept, we first analyzed the information flow in noisy feedback channels and then showed 
that the capacity of DMC is not achievable by using any typical closed-loop encoder. We next proved a general 
channel coding theorem in terms of the proposed residual directed information. Finally, we proposed computable 
bounds characterized by the causal conditional directed information. 

The results in the paper open up new directions for investigating the role of noisy feedback in communication 
systems. Furthermore, the new definitions, concepts and methodologies presented in the paper are potential to be 
extended to multiple access channels, broadcast channels or general multi-user channels with noisy feedback. 



VII. Appendix 



A. Proof of Lemma 



Before giving the proof, we need the following Lemma. 



Lemma 6: For channels with noisy feedback, as shown in Fig[U 



P (x n ,y n )= E Y[p(^\y\^ l - 1 )p(^ 




Feedback link 



Encoding 



Channel 
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Proof: 

p(x n ,y n )= J2 P^ n ,y n ,z n ) 



= J2 P(z n \x n ,y n ,z n - 1 )p(x n ,y n ,z n - 1 ) 

= J2 P(z n \x n ,y n ,z n - 1 )p(y n \x n ,y n - 1 ,z n - 1 )p(x n ,y n - 1 ,z n - 1 ) 

= E P(^\x n ,y n ,z n - 1 )p(y n \x n ,y n - 1 ,z n - 1 )p(x n \x n -\y n - 1 ,z n - 1 ) 



„n— 1 n ,n—l 



p(x n -\y n -\z n - 1 ) 

E Piznlyv^z^piy^^-^pixn^-^z"- 1 ) 



(a) 

„n— 1 1 — 1 



p{x n -\y n -\z n ^) 

n 



z"E2" i=l 

where (a) follows from the Markov chains: x n — (y n , z n ~ 1 ) — z n , z n ~ 1 — (x n ,y n ~ 1 )—y n and y n ~ l — {x n ~ l , z n ~ l ) 



x n . 



Now, we are ready to give the proof of Lemma [4] 
Proof: 

P(x n ,y n j n ) 

= p(x n ,y n \f n )p(f n ) 

n 



(a) 

2 »£2" i=l 



(6) 



pin E n^if i ' a!i " 1 '/ n )p(wi/ i (« i " 1 ) ) » i " 1 .r) 

2™e{2" :a ;"=/"(2"- 1 )}i=l 
n 

K/ n ) E IIp(^ltf < .* < " 1 )Kwl/ , (« , '" 1 ) ) » < " 1 ) 

gn e {2r» :x » = yn( J! »-l)}i = l 



=nn^( zi " i )i/ i " i ^^ i " 1 ) e iip^if''^" 1 )^*!/*^" 1 )'^" 1 ) 

i=l«*-i z"6{Z":z : '*=/"(z"- 1 )} i=l 

where (a) follows from Lemma [6] Line (b) follows from the Markov chains: /" — (y l ,z l ~ 1 ) — Zi and f r 
{P{z l ~ 1 ) 1 y 1 " 1 ) — yi- Line (c) follows from Lemma [3] 
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B. Proof of Theorem 

We now prove the channel coding theorem by combining the following converse theorem and achievability 
theorem. 

a). Converse Theorem 

The following is a generalization of theorem 4 in l26l which gives an upper bound for bounding the block error 
probability. 

Lemma 7: Every (n, M, e) channel code satisfies 

e > Prob{-i R (X n (F n ) Y n ) < - log M - 7} - 2~ 7 ™ 
n n 

for every 7 > 0. 

Proof: We assume the disjointness of the decoding sets D. i.e. D w= iCiD w= j = if i ^ j. Under this restriction 
on the decoder, ||26l has shown that any (n, M, e) channel code for the nonfeedback channel T n — ► y n satisfies 
for all 7 > 

e > Prob{-i(F n ; Y n ) < - logM - 7} - 2~ 7 " 
n n 

By Lemma [2] we have 

i(F n ; Y n ) = i R (X n (F n ) -> Y n ) 

The proof is complete. ■ 
Note that this Lemma holds independently of the decoder that one uses. The only restriction on the decoder is 

the disjointness of the decoding region. 
Theorem 8: (Converse Theorem) 

Cfb <snpL R (X(F) ->F) 
x 



Proof: Assume that there exists a sequence of (n, M, e n ) channel codes with e n — > as n — > 00 and with 
transmission rate 

R = lim inf — log M. 

n— ±00 tl 
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By Lemma |7l we know that for all 7 > 0, 

e„ > Prob{-i R (X n (F n ) -> Y n ) < - logM - 7} - 2~ T ' 1 
n n 

As n — > 00, the probability on the right-hand side must go to zero since e n — > 0. By the definition of F R {X(F) — > 
Y), we have 

limsup - log M - 7 < L R (X(F) -> Y) 

n— >oo 7i 

Since 7 can be arbitrarily small, we have 

i? < limsup - logM < I R (X(F) -> y) < sup/ i? (X(F) -> y) 

The proof is complete. ■ 
b). Achievability Theorem 

The following is a generalization of Feinstein's lemma ll32ll based on the residual directed information. 

Lemma 8: Fix a positive integer n, < e < 1, a channel {^(yjlx 1 , y l_1 )}™ =1 and a feedback link {p(zj|y 4 , 
For every 7 > and a channel input distribution {p(x,-|a;* _1 , a; t_1 )}^_ 1 , there exists a channel code (n, M, e) that 
satisfies 

e < Pro6{-i R (X"(F n ) -> y n ) < - logM + 7} + 2" 7 " 
n n 



Proof: Given a channel input distribution {p(xi\x l ~ 1 , z l ~ 1 )}f =1 , we generate a code-function distribution 
{p(fi\f 1, ~ 1 )}i=i sucn mat m e induced channel input distribution equals the original channel input distribution. There 
exists such a code-function distribution according to Lemma [3] In l26l . it has been shown that for a nonfeedback 
channel {p{yi\,f l , a channel input distribution {p(/;|/ l ~ 1 )}f =1 and for every 7 > 0, there exists a 

channel code (n, M, e) that satisfies 

e<Prob{-i(F n ;Y n ) < - log M + 7} + 2~ 7 " 

Recall that this result is proved by random coding argument. Then, by Lemma [2] we have 

i(F n ; Y n ) = i R (X n (F n ) -> y") 

The proof is complete after simple replacement. ■ 
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Theorem 9: (Achievability Theorem) 



C^ se > sup l H (X{F) -> Y) 

x 

Proof: Fix arbitrary < e < 1 and channel input distribution {p(xi\x t ~ 1 , We shall show that 

I_ R (X(F) — > Y) is a e-achievable rate by demonstrating that, for every 5 > 0, and all sufficient large n, there exists 
a (n, M, 2"^ + |) code with rate 

L R (X(F) ^Y)-5< ^ < L R (X(F) ^Y)- 6 - 

n 2 

If, in Lemma [8] we choose 7 = |, then the right-hand side value in Lemma [8] becomes 

Prob{-i R (X n (F n ) -+ Y n ) < -logM + ^} + 2"T 
n n 4 

<ProM-i R (X"(F") r) < L R (X(F) F) } + 2"T 

n 4 

where the second inequality holds for all sufficiently large n because of the definition of I_ R {X(F) —> Y). Therefore, 
l R {X(F) -> Y) is a e-achievable rate. ■ 
The proof of Theorem [4] is obtained by combining Theorem [8] and Theorem [9] 
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